Textual Data Formats

Let's learn about the common encoding styles used in modern API architectures.

Introduction#

Web APIs communicate over the network, and therefore, data needs to be serialized before being transmitted onto the network. The following diagram shows the typical data encoding workflow of a server sending its response to the remote client and vice versa.

Data serialization in API communication
Data serialization in API communication

We can serialize data using serialization methods provided by a language, such as Marshal in Ruby, pickle in Python, Serializable in Java, and so on. External libraries can also be used for this purpose, but they have some common problems. They are language-specific and can tight-couple APIs to that language. Furthermore, these methods lack efficient version control mechanisms and are generally inefficient regarding CPU utilization and encoded data size. Therefore, developers prefer using standardized encoding formats—such as XML, JSON, Protobuf, Avro, and so on—instead of built-in encoding methods.

Textual representation formats#

JSON, XML, and CSV are data representation formats that use Unicode characters (text) and support encoding in many languages. The most popular way clients interact with web APIs is by serializing data into XML and JSON formats. CSV is less powerful than XML and JSON because it doesn’t support the data hierarchy and is commonly used when dealing with data in tabular form. These structures have a standardized encoding style, and clients can easily extract and consume data by deserializing it into their respective languages. Text-based representation formats have the following common characteristics:

  • Schemaless structure: Data-related information (metadata) is embedded within the format structure.

  • Machine-readable: The data is well structured and can be easily interpreted by machines.

  • Human-readable: Data is represented in Unicode characters that humans can read and understand.

  • Object representation: Data can embed programming objects in a defined format.

In the following sections, we’ll discuss XML and JSON, two well-known text data formats for developing APIs.

XML#

Extensible Markup Language (XML) is a restricted form/subset of Standard Generalized Markup Language (SGML) and is designed for storing and exchanging data on the web. XML has a syntax similar to HTML documents. It uses tags to construct objects, but unlike HTML, tags are not predefined. This means that users can create custom tags to define elements. Generally, tags are used as keys to identify data elements and can also have attributes defining metadata that can help data filtering and sorting, etc. An element's value is stored within an object's opening and closing tags. The illustrations below describe the structure of an XML message, along with an example.

Structure of an XML message
Structure of an XML message
<?xml version="1.0" encoding="UTF-8"?>
<message chatid="123" lang="en">
  <head category="private-chat"> <warning>Some  alerts</warning>
  </head>
  
<body disablelinks="true">
    <author>John Snow</author>
   <text>Hello from XML format</text>
  </body>
</message>

Explanation#

The tag <?xml ?> in the XML document above defines the XML version and encoding style used in the document and is part of the processing instructions for applications. XML follows a tree-based structure and the message tag represents the root element of the document. This is contrast to the head and body tags, which are children of the root element message, and author and text are children of body.

Note: To read more on the syntax and semantics of XML see the W3 documentation.

Advantages#

XML provides a robust and extensible structural representation by using namespaces to define and reuse data-related information in a standardized way. It supports features like XPath and XQuery. It also supports adding new entities over time without affecting the functionality of existing ones. XML is highly portable and is being used in a variety of different domains, described below:

  • XML is widely used in electronic data interchange to quickly and securely access information between different systems. For example, SOAP uses XML for secure (encapsulates sensitive content in an XML envelope) business-to-business data exchange, while WordPress supports XML-RPC to communicate with other systems.

  • XML offers streamlined methods for accessing information (XPath, XQuery, etc.), and its search results are precise. Because of this simplicity and efficiency in obtaining information, XML is popular in the web automation industry for building crawlers and web bots.

  • XML supports the storage and exchange of complex data in a standardized way. For example, it’s used in the wireless communication industry, such as VoiceWeb, Personal Digital Assistants, and so on.

Limitations#

While XML is widespread and has many benefits, it also has some disadvantages when it comes to choosing a data format in API design:

  • Repeated declaration of tags makes it verbose and redundant.

  • It has a large file size due to the amount of tags maintaining the document structure.

  • XML lacks a built-in mechanism for distinguishing numbers from their string representation.

  • Lack of support for binary strings (raw bytes of data without character encoding).

  • Although XML supports optional schema definitions, applications tend to hard-code their logic for interpreting data.

JSON#

JavaScript Object Notation (JSON) is a subset of the JavaScript language and is famous for its built-in browser support. It efficiently handles client-side and server-side scripts with the added benefit of simplicity and readability. It uses colon-separated key-value pairs to describe data attributes. JSON supports four primitive data types: strings, numbers, booleans, and null. The data is structured using the following six characters:

  • Left curly bracket ({): Marks the start of a data object

  • Right curly bracket (}): Marks the end of a data object

  • Left square bracket ([): Indicates the start of an array

  • Right square bracket (]): Indicates the end of an array

  • Colon (:): Separates the key/name from the value

  • Comma (,): Separate key-value pairs from each other

Here is the JSON representation of the same message object that was discussed in the earlier section on XML format:

JSON representation of the message object
JSON representation of the message object

JSON allows nested arrays and objects. In the code snippet above, the message is an object, and its members head and body are also objects that are nested inside it. All the keys are represented as a string, and its value must be a string, number, boolean, array, object, or null.

null
false
true
string
object
array
string
value
key

Note: To read more on the JSON data format, see RFC 8259.

Advantages#

The goal of JSON is to be structurally minimal while being a portable and flexible subset of JavaScript. It has the following advantages over textual data format competitors:

  • Language-independent and has a lower learning curve than XML.

  • Gaining more and more attention with the increasing popularity of Node.js and JavaScript in web and mobile applications.

  • Lightweight and parses faster than XML.

  • More compact and has lower network latency than XML.

Limitations#

JSON is still on its way to maturity in comparison to XML, and the following issues have yet to be addressed:

  • Lacks a standardized way to define and reuse common descriptions for JSON documents.

  • Lacks a built-in way to distinguish between numbers and floats in a JSON document.

  • Does not support precision beyond 253 for a 64-bit numeric representation and can cause parsing errors.

  • Despite JSON supporting schema definition, the use of schema is not as widespread, and hard-coded interpretation logic for data such as numbers and binary strings can cause compatibility issues.

We can use a Base64 encoding to send raw bytes without giving the data any character representation. But it might not be the most optimally compact representation.

Let's see how these data formats map to our general criteria of optimal data format.

Feature Support Comparison

Feature

XML

JSON

Human-readable

Yes, but requires more effort compared to JSON

Yes, key-value pairs are easier to read than XML

Latency

High, due to larger size

Low, because it’s compact

Standardized

Supports partial standardization using namespace and optional schema

Developers don't use keys in a standardized way and the use of optional schema is not common

Machine friendly

No

No

Interoperable

Being used in multiple domains and can deal with a variety of different systems

Popular with JavaScript-based systems but can also be used with other systems

Flexible

Full compatibility (both forward and backward compatible)

Full compatibility

Quiz

Question 2

When should we prefer XML over JSON?

Hide Answer

Both XML and JSON have similar properties. However, XML may be preferable to JSON when the data contains complex structures, such as those found in books, articles, and other datasets. XML can represent documents containing various data structures because it represents data in a tree-like format, making it easier to manage different data structures in an organized manner. It also supports custom attributes and provides methods such as XPath and XQuery, which are suitable for large metadata sets representing complex structures.

2 of 2

Data Representation and Efficient Communication in APIs

Binary Data Formats